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Abstract 

We  introduce  a  novel  set  of  features  for  robust  object 
recognition.  Each  element  of  this  set  is  a  complex  feature 
obtained  by  combining  position-  and  scale-tolerant  edge- 
detectors  over  neighboring  positions  and  multiple  orienta¬ 
tions.  Our  system’s  architecture  is  motivated  by  a  quantita¬ 
tive  model  of  visual  cortex. 

We  show  that  our  approach  exhibits  excellent  recogni¬ 
tion  performance  and  outperforms  several  state-of-the-art 
systems  on  a  variety  of  image  datasets  including  many  dif¬ 
ferent  object  categories.  We  also  demonstrate  that  our  sys¬ 
tem  is  able  to  learn  from  very  few  examples.  The  perfor¬ 
mance  of  the  approach  constitutes  a  suggestive  plausibility 
proof  for  a  class  of  feedforward  models  of  object  recogni¬ 
tion  in  cortex. 

1  Introduction 

Hierarchical  approaches  to  generic  object  recognition 
have  become  increasingly  popular  over  the  years.  These  are 
in  some  cases  inspired  by  the  hierarchical  nature  of  primate 
visual  cortex  [10,  25],  but,  most  importantly,  hierarchical 
approaches  have  been  shown  to  consistently  outperform  flat 
single-template  (holistic)  object  recognition  systems  on  a 
variety  of  object  recognition  tasks  [7,  10].  Recognition  typ¬ 
ically  involves  the  computation  of  a  set  of  target  features 
(also  called  components  [7],  parts  [24]  or  fragments  [22]) 
at  one  step  and  their  combination  in  the  next  step.  Fea¬ 
tures  usually  fall  in  one  of  two  categories:  template-based 
or  histogram-based.  Several  template-based  methods  ex¬ 
hibit  excellent  performance  in  the  detection  of  a  single  ob¬ 
ject  category,  e.g.,  faces  [17,  23],  cars  [17]  or  pedestri¬ 
ans  [14].  Constellation  models  based  on  generative  meth¬ 
ods  perform  well  in  the  recognition  of  several  object  cate¬ 


gories  [24,  4],  particularly  when  trained  with  very  few  train¬ 
ing  examples  [3].  One  limitation  of  these  rigid  template- 
based  features  is  that  they  might  not  adequately  capture 
variations  in  object  appearance:  they  are  very  selective  for  a 
target  shape  but  lack  invariance  with  respect  to  object  trans¬ 
formations.  At  the  other  extreme,  histogram-based  descrip¬ 
tors  [12,  2]  are  very  robust  with  respect  to  object  transfor¬ 
mations.  The  SiFT-based  features  [12],  for  instance,  have 
been  shown  to  excel  in  the  re-detection  of  a  previously  seen 
object  under  new  image  transformations.  However,  as  we 
confirm  experimentally  (see  section  4),  with  such  degree  of 
invariance,  it  is  unlikely  that  the  SiFT-based  features  could 
perform  well  on  a  generic  object  recognition  task. 

In  this  paper,  we  introduce  a  new  set  of  biologically- 
inspired  features  that  exhibit  a  better  trade-off  between  in¬ 
variance  and  selectivity  than  template-based  or  histogram- 
based  approaches.  Each  element  of  this  set  is  a  feature  ob¬ 
tained  by  combining  the  response  of  local  edge-detectors 
that  are  slightly  position-  and  scale-tolerant  over  neighbor¬ 
ing  positions  and  multiple  orientations  (like  complex  cells 
in  primary  visual  cortex).  Our  features  are  more  flexible 
than  template-based  approaches  [7,  22]  because  they  allow 
for  small  distortions  of  the  input;  they  are  more  selective 
than  histogram-based  descriptors  as  they  preserve  local  fea¬ 
ture  geometry.  Our  approach  is  as  follows:  for  an  input  im¬ 
age,  we  first  compute  a  set  of  features  learned  from  the  posi¬ 
tive  training  set  (see  section  2).  We  then  run  a  standard  clas¬ 
sifier  on  the  vector  of  features  obtained  from  the  input  im¬ 
age.  The  resulting  approach  is  simpler  than  the  aforemen¬ 
tioned  hierarchical  approaches:  it  does  not  involve  scanning 
over  all  positions  and  scales,  it  uses  discriminative  methods 
and  it  does  not  explicitly  model  object  geometry.  Yet  it  is 
able  to  learn  from  very  few  examples  and  it  performs  sig¬ 
nificantly  better  than  all  the  systems  we  have  compared  it 
with  thus  far. 
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Band  E 

1 

2 

3 

4 

5 

6 

7 

8 

filt.  sizes  s 

7  &  9 

11  &  13 

15  &  17 

19  &  21 

23  &  25 

27  &  29 

31  &  33 

35  &  37 

<j 

2.8  &  3.6 

4.5  &  5.4 

6.3  &  7.3 

8.2  &  9.2 

10.2  &  11.3 

12.3  &  13.4 

14.6  &  15.8 

17.0  &  18.2 

A 

3.5  &  4.6 

5.6  &  6.8 

7.9  &  9.1 

10.3  &  11.5 

12.7  &  14.1 

15.4  &  16.8 

18.2  &  19.7 

21.2  &  22.8 

grid  size 

8 

10 

12 

14 

16 

18 

20 

22 

orient.  6 

q.  7 r  .  7 r  .  37T 

u5  4  5  2  ’  4 

patch  sizes  rii 

4  x  4;  8  x  8;  12  x  12;  16  x  16  (x4  orientations) 

Table  1.  Summary  of  parameters  used  in  our  implementation  (see  Fig.  1  and  accompanying  text). 


Biological  visual  systems  as  guides.  Because  humans 
and  primates  outperform  the  best  machine  vision  systems 
by  almost  any  measure,  building  a  system  that  emulates 
object  recognition  in  cortex  has  always  been  an  attractive 
idea.  However,  for  the  most  part,  the  use  of  visual  neuro¬ 
science  in  computer  vision  has  been  limited  to  a  justifica¬ 
tion  of  Gabor  filters.  No  real  attention  has  been  given  to 
biologically  plausible  features  of  higher  complexity.  While 
mainstream  computer  vision  has  always  been  inspired  and 
challenged  by  human  vision,  it  seems  to  never  have  ad¬ 
vanced  past  the  first  stage  of  processing  in  the  simple  cells 
of  primary  visual  cortex  VI.  Models  of  biological  vi¬ 
sion  [5,  13,  16,  1]  have  not  been  extended  to  deal  with 
real-world  object  recognition  tasks  ( e.g .,  large  scale  natu¬ 
ral  image  databases)  while  computer  vision  systems  that  are 
closer  to  biology  like  LeNet  [10]  are  still  lacking  agreement 
with  physiology  (e.g.,  mapping  from  network  layers  to  cor¬ 
tical  visual  areas).  This  work  is  an  attempt  to  bridge  the  gap 
between  computer  vision  and  neuroscience. 

Our  system  follows  the  standard  model  of  object  recog¬ 
nition  in  primate  cortex  [16],  which  summarizes  in  a  quan¬ 
titative  way  what  most  visual  neuroscientists  agree  on:  the 
first  few  hundreds  milliseconds  of  visual  processing  in  pri¬ 
mate  cortex  follows  a  mostly  feedforward  hierarchy.  At 
each  stage,  the  receptive  fields  of  neurons  (i.e.,  the  part  of 
the  visual  field  that  could  potentially  elicit  a  neuron’s  re¬ 
sponse)  tend  to  get  larger  along  with  the  complexity  of  their 
optimal  stimuli  (i.e.,  the  set  of  stimuli  that  elicit  a  neuron’s 
response).  In  its  simplest  version,  the  standard  model  con¬ 
sists  of  four  layers  of  computational  units  where  simple  S 
units,  which  combine  their  inputs  with  Gaussian-like  tun¬ 
ing  to  increase  object  selectivity,  alternate  with  complex  C 
units,  which  pool  their  inputs  through  a  maximum  oper¬ 
ation,  thereby  introducing  gradual  invariance  to  scale  and 
translation.  The  model  has  been  able  to  quantitatively  du¬ 
plicate  the  generalization  properties  exhibited  by  neurons 
in  inferotemporal  monkey  cortex  (the  so-called  view-tuned 
units)  that  remain  highly  selective  for  particular  objects  (a 
face,  a  hand,  a  toilet  brush)  while  being  invariant  to  ranges 
of  scales  and  positions.  The  model  originally  used  a  very 
simple  static  dictionary  of  features  (for  the  recognition  of 
segmented  objects)  although  it  was  suggested  in  [16]  that 
features  in  intermediate  layers  should  instead  be  learned 
from  visual  experience. 


We  extend  the  standard  model  and  show  how  it 
can  learn  a  vocabulary  of  visual  features  from  natu¬ 
ral  images.  We  prove  that  the  extended  model  can 
robustly  handle  the  recognition  of  many  object  cate¬ 
gories  and  compete  with  state-of-the-art  object  recogni¬ 
tion  systems.  This  work  appeared  in  a  very  prelim¬ 
inary  form  in  [18].  Our  source  code  as  well  as  an 
extended  version  of  this  paper  [20]  can  be  found  at 
http : // cbcl . mit . edu/ software -datasets. 


2  The  C2  features 


Our  approach  is  summarized  in  Fig.  1:  the  first  two  lay¬ 
ers  correspond  to  primate  primary  visual  cortex,  VI,  i.e.,  the 
first  visual  cortical  stage,  which  contains  simple  (SI)  and 
complex  (Cl)  cells  [8].  The  SI  responses  are  obtained  by 
applying  to  the  input  image  a  battery  of  Gabor  filters,  which 
can  be  described  by  the  following  equation: 


G(x, y )  =  exp 


/  (x2  +  72y2)\ 

V  2<x2  J 


X  COS 


where  X  —  x  cos  0  +  y  sin  6  and  Y  =  —  x  sin  6  +  y  cos  0. 

We  adjusted  the  filter  parameters,  i.e.,  orientation  6 ,  ef¬ 
fective  width  a,  and  wavelength  A,  so  that  the  tuning  pro¬ 
files  of  SI  units  match  those  of  VI  parafoveal  simple  cells. 
This  was  done  by  first  sampling  the  space  of  parameters  and 
then  generating  a  large  number  of  filters.  We  applied  those 
filters  to  stimuli  commonly  used  to  probe  VI  neurons  [8] 
(i.e.,  gratings,  bars  and  edges).  After  removing  filters  that 
were  incompatible  with  biological  cells  [8],  we  were  left 
with  a  final  set  of  16  filters  at  4  orientations  (see  Table  1 
and  [19]  for  a  full  description  of  how  those  filters  were  ob¬ 
tained). 

The  next  stage  -  Cl  -  corresponds  to  complex  cells 
which  show  some  tolerance  to  shift  and  size:  complex  cells 
tend  to  have  larger  receptive  fields  (twice  as  large  as  simple 
cells),  respond  to  oriented  bars  or  edges  anywhere  within 
their  receptive  field  [8]  (shift  invariance)  and  are  in  gen¬ 
eral  more  broadly  tuned  to  spatial  frequency  than  simple 
cells  [8]  (scale  invariance).  Modifying  the  original  Hubei 
&  Wiesel  proposal  for  building  complex  cells  from  simple 
cells  through  pooling  [8],  Riesenhuber  &  Poggio  proposed  a 
max-like  pooling  operation  for  building  position-  and  scale- 
tolerant  Cl  units.  In  the  meantime,  experimental  evidence 
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Given  an  input  image  /,  perform  the  following  steps:. 

SI:  Apply  a  battery  of  Gabor  filters  to  the  input  image. 
The  filters  come  in  4  orientations  0  and  16  scales  s  (see 
Table  1).  Obtain  16  x  4  =  64  maps  (Sl)se  that  are  arranged 
in  8  bands  ( e.g .,  band  1  contains  filter  outputs  of  size  7  and 
9,  in  all  four  orientations,  band  2  contains  filter  outputs  of 
size  11  and  13,  etc). 

Cl:  For  each  band ,  take  the  max  over  scales  and  po¬ 
sitions:  each  band  member  is  sub- sampled  by  taking  the 
max  over  a  grid  with  cells  of  size  N s  first  and  the  max 
between  the  two  scale  members  second,  e.g.,  for  band  1,  a 
spatial  max  is  taken  over  an  8  x  8  grid  first  and  then  across 
the  two  scales  (size  7  and  9).  Note  that  we  do  not  take  a 
max  over  different  orientations,  hence,  each  band  (C1)E 
contains  4  maps. 

During  training  only:  Extract  K  patches  Pi=itt..K  of 
various  sizes  rii  x  rii  and  all  four  orientations  (thus 
containing  rii  x  rii  x  4  elements)  at  random  from  the 
(Cl)s  maps  from  all  training  images. 


S2:  For  each  Cl  image  (Cl)s,  compute: 

Y  =  exp(— y\\X  —  Pi|| 2)  for  all  image  patches  X  (at  all 
positions)  and  each  patch  P  learned  during  training  for 
each  band  independently.  Obtain  S2  maps  (S 2)f. 

C2:  Compute  the  max  over  all  positions  and  scales  for 
each  S2  map  type  (S2)i  (i.e.,  corresponding  to  a  particular 
patch  Pi)  and  obtain  shift-  and  scale-invariant  C2  features 
(C2)j  ,  for  i  =  1  ...K. _ 


Figure  1.  Computation  of  C2  features. 

in  favor  of  the  max  operation  has  appeared  [6,  9].  Again 
pooling  parameters  were  set  so  that  Cl  units  match  the  tun¬ 
ing  properties  of  complex  cells  as  measured  experimentally 
(see  Table  1  and  [19]  for  a  full  description  of  how  those 
filters  were  obtained). 

Fig.  2  illustrates  how  pooling  from  SI  to  Cl  is  done.  SI 
units  come  in  16  scales  s  arranged  in  8  bands  X.  For  in¬ 
stance,  consider  the  first  band  X  =  1.  For  each  orientation, 
it  contains  two  SI  maps:  one  obtained  using  a  filter  of  size 
7,  and  one  obtained  using  a  filter  of  size  9.  Note  that  both  of 
these  SI  maps  have  the  same  dimensions.  In  order  to  obtain 
the  Cl  responses,  these  maps  are  sub-sampled  using  a  grid 
cell  of  size  N E  x  N E  =8x8.  From  each  grid  cell  we 
obtain  one  measurement  by  taking  the  maximum  of  all  64 
elements.  As  a  last  stage  we  take  a  max  over  the  two  scales, 
by  considering  for  each  cell  the  maximum  value  from  the 
two  maps.  This  process  is  repeated  independently  for  each 
of  the  four  orientations  and  each  scale  band. 

In  our  new  version  of  the  standard  model  the  subse¬ 
quent  S2  stage  is  where  learning  occurs.  A  large  pool  of  K 
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Figure  2.  Scale-  and  position- tolerance  at  the  complex  cells  (Cl) 
level:  Each  Cl  unit  receives  inputs  from  SI  units  at  the  same  pre¬ 
ferred  orientation  arranged  in  bands  X,  i.e.,  SI  units  in  two  differ¬ 
ent  sizes  and  neighboring  positions  (grid  cell  of  size  N s  x  7VS). 
From  each  grid  cell  (left)  we  obtain  one  measurement  by  taking 
the  max  over  all  positions  allowing  the  Cl  unit  to  respond  to  an 
horizontal  edge  anywhere  within  the  grid  (tolerance  to  shift).  Sim¬ 
ilarly,  by  taking  a  max  over  the  two  sizes  (right)  the  Cl  unit  be¬ 
comes  tolerant  to  slight  changes  in  scale. 


patches  of  various  sizes  at  random  positions  are  extracted 
from  a  target  set  of  images  at  the  Cl  level  for  all  orienta¬ 
tions,  i.e.,  a  patch  Pi  of  size  rii  x  rii  contains  rii  x  rii  x  4  el¬ 
ements,  where  the  4  factor  corresponds  to  the  four  possible 
SI  and  Cl  orientations.  In  our  simulations  we  used  patches 
of  size  rii  =  4,  8, 12  and  16  but  in  practice  any  size  can 
be  considered.  The  training  process  ends  by  setting  each  of 
those  patches  as  prototypes  or  centers  of  the  S2  units  which 
behave  as  radial  basis  function  (RBF)  units  during  recog¬ 
nition,  i.e.,  each  S2  unit  response  depends  in  a  Gaussian- 
like  way  on  the  Euclidean  distance  between  a  new  input 
patch  (at  a  particular  location  and  scale)  and  the  stored  pro¬ 
totype.  This  is  consistent  with  well-known  neuron  response 
properties  in  primate  inferotemporal  cortex  and  seems  to  be 
the  key  property  for  learning  to  generalize  in  the  visual  and 
motor  systems  [15].  When  a  new  input  is  presented,  each 
stored  S2  unit  is  convolved  with  the  new  (Cl)s  input  im¬ 
age  at  all  scales  (this  leads  to  K  x  8  (S 2)f  images,  where 
the  K  factor  corresponds  to  the  K  patches  extracted  during 
learning  and  the  8  factor,  to  the  8  scale  bands).  After  taking 
a  final  max  for  each  (52)  ^  map  across  all  scales  and  posi¬ 
tions,  we  get  the  final  set  of  K  shift-  and  scale-invariant  C2 
units.  The  size  of  our  final  C2  feature  vector  thus  depends 
only  on  the  number  of  patches  extracted  during  learning  and 
not  on  the  input  image  size.  This  C2  feature  vector  is  passed 
to  a  classifier  for  final  analysis.1 

An  important  question  for  both  neuroscience  and  com¬ 
puter  vision  regards  the  choice  of  the  unlabeled  target  set 
from  which  to  learn  -  in  an  unsupervised  way  -  this  vocab¬ 
ulary  of  visual  features.  In  this  paper,  features  are  learned 
from  the  positive  training  set  for  each  object  category  (but 
see  [20]  for  a  discussion  on  how  features  could  be  learned 
from  random  natural  images). 

Tt  is  likely  that  our  (non-biological)  final  classifier  could  correspond 
to  the  task-specific  circuits  found  in  prefrontal  cortex  (PFC)  and  C2  units 
with  neurons  in  inferotemporal  (IT)  cortex  [16].  The  S2  units  could  be 
located  in  V4  and/or  in  posterior  inferotemporal  (PIT)  cortex. 
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Figure  3.  Examples  from  the  MIT  face  and  car  datasets. 


3.  Experimental  Setup 

We  tested  our  system  on  various  object  categorization 
tasks  for  comparison  with  benchmark  computer  vision  sys¬ 
tems.  All  datasets  we  used  are  made  up  of  images  that  either 
contain  or  do  not  contain  a  single  instance  of  the  target  ob¬ 
ject;  The  system  has  to  decide  whether  the  target  object  is 
present  or  absent. 

MIT-CBCL  datasets:  These  include  a  near- frontal 
(±30°)  face  dataset  for  comparison  with  the  component- 
based  system  of  Heisele  et  al.  [7]  and  a  multi-view  car 
dataset  for  comparison  with  [11].  These  two  datasets  are 
very  challenging  (see  typical  examples  in  Fig.  3).  The  face 
patterns  used  for  testing  constitute  a  subset  of  the  CMU 
PIE  database  which  contains  a  large  variety  of  faces  un¬ 
der  extreme  illumination  conditions  (see  [7]).  The  test  non¬ 
face  patterns  were  selected  by  a  low-resolution  LDA  clas¬ 
sifier  as  the  most  similar  to  faces  (the  LDA  classifier  was 
trained  on  an  independent  19  x  19  low-resolution  training 
set).  The  full  set  used  in  [7]  contains  6,900  positive  and 
13,700  negative  70  x  70  images  for  training  and  427  positive 
and  5,000  negative  images  for  testing.  The  car  database  on 
the  other  hand  was  created  by  taking  street  scene  pictures  in 
the  Boston  city  area.  Numerous  vehicles  (including  SUVs, 
trucks,  buses,  etc)  photographed  from  different  view-points 
were  manually  labeled  from  those  images  to  form  a  positive 
set.  Random  image  patterns  at  various  scales  that  were  not 
labeled  as  vehicles  were  extracted  and  used  as  the  negative 
set.  The  car  dataset  used  in  [1 1]  contains  4,000  positive  and 
1,600  negative  120  x  120  training  examples  and  3,400  test 
examples  (half  positive,  half  negative).  While  we  tested  our 
system  on  the  full  test  sets,  we  considered  a  random  sub¬ 
set  of  the  positive  and  negative  training  sets  containing  only 
500  images  each  for  both  the  face  and  the  car  database. 

The  Caltech  datasets:  The  Caltech  datasets 
contain  101  objects  plus  a  background  category 
(used  as  the  negative  set)  and  are  available  at 
http  :  / /www.  vision,  caltech.  edu.  For  each  ob¬ 
ject  category,  the  system  was  trained  with  n  =  1,3,  6, 15,  30 
or  40  positive  examples  from  the  target  object  class  (as 
in  [3])  and  50  negative  examples  from  the  background 
class.  From  the  remaining  images,  we  extracted  50  images 


Datasets 

Bench. 

C2  features 
boost  SVM 

Leaves  (Calt.) 

[24] 

84.0 

97.0 

95.9 

Cars  (Calt.) 

[4] 

84.8 

99.7 

99.8 

Faces  (Calt.) 

[4] 

96.4 

98.2 

98.1 

Airplanes  (Calt.) 

[4] 

94.0 

96.7 

94.9 

Moto.  (Calt.) 

[4] 

95.0 

98.0 

97.4 

Faces  (MIT) 

[7] 

90.4 

95.9 

95.3 

Cars  (MIT) 

[11] 

75.4 

95.1 

93.3 

Table  2.  C2  features  vs.  other  recognition  systems  (Bench.). 

from  the  positive  and  50  images  from  the  negative  set  to 
test  the  system’s  performance.  As  in  [3],  the  system’s 
performance  was  averaged  over  10  random  splits  for  each 
object  category.  All  images  were  normalized  to  140  pixels 
in  height  (width  was  rescaled  accordingly  so  that  the  image 
aspect  ratio  was  preserved)  and  converted  to  gray  values 
before  processing.  These  datasets  contain  the  target  object 
embedded  in  a  large  amount  of  clutter  and  the  challenge  is 
to  learn  from  unsegmented  images  and  discover  the  target 
object  class  automatically.  For  a  close  comparison  with 
the  system  by  Fergus  et  al.  we  also  tested  our  approach 
on  a  subset  of  the  101 -object  dataset  using  the  exact  same 
split  as  in  [4]  (the  results  are  reported  in  Table  2)  and  an 
additional  leaf  database  as  in  [24]  for  a  total  of  five  datasets 
that  we  refer  to  as  the  Caltech  datasets  in  the  following. 

4  Results 

Table  2  contains  a  summary  of  the  performnace  of  the 
C2  features  when  used  as  input  to  a  linear  SVM  and  to 
gentle  Ada  Boost  (denoted  boost)  on  various  datasets.  For 
both  our  system  and  the  benchmarks,  we  report  the  error 
rate  at  the  equilibrium  point,  i.e.,  the  error  rate  at  which 
the  false  positive  rate  equals  the  miss  rate.  Results  ob¬ 
tained  with  the  C2  features  are  consistently  higher  than 
those  previously  reported  on  the  Caltech  datasets.  Our  sys¬ 
tem  seems  to  outperform  the  component-based  system  pre¬ 
sented  in  [7]  (also  using  SVM)  on  the  MIT-CBCL  face 
database  as  well  as  a  fragment-based  system  implemented 
by  [11]  that  uses  template-based  features  with  gentle  Ada 
Boost  (similar  to  [21]). 

Fig.  4  summarizes  the  system  performance  on  the  101- 
object  database.  On  the  left  we  show  the  results  obtained 
using  our  system  with  gentle  Ada  Boost  (we  found  qual¬ 
itatively  similar  results  with  a  linear  SVM)  over  all  101 
categories  for  1,  3,  6,  15,  30  and  40  positive  training  ex¬ 
amples  (each  result  is  an  average  of  10  different  random 
splits).  Each  plot  is  a  single  histogram  of  all  101  scores,  ob¬ 
tained  using  a  fixed  number  of  training  examples  ( e.g .,  with 
40  examples  the  system  gets  95%  correct  for  42%  of  the 
object  categories).  On  the  right  we  focus  on  some  of  the 
same  object  categories  as  the  ones  used  by  Fei-Fei  et  al.  for 
illustration  in  [3]:  the  C2  features  achieve  error  rates  very 
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Figure  4.  C2  features  performance  on  the  101-object  database  for  different  numbers  of  positive  training  examples:  (left)  histogram  across 
the  101  categories  and  (right)  performance  on  sample  categories,  see  accompanying  text. 


Figure  5.  Superiority  of  the  C2  vs.  SIFT-based  features  on  the  Caltech 
database  for  different  number  of  training  examples(right). 

similar  to  the  ones  reported  in  [3]  with  very  few  training 
examples. 

We  also  compared  our  C2  features  to  SIFT-based  fea¬ 
tures  [12].  We  selected  1000  random  reference  key-points 
from  the  training  set.  Given  a  new  image,  we  measured  the 
minimum  distance  between  all  its  key-points  and  the  1000 
reference  key-points,  thus  obtaining  a  feature  vector  of  size 
1000  (for  this  comparison  we  did  not  use  the  position  in¬ 
formation  recovered  by  the  algorithm).  While  Lowe  recom¬ 
mends  using  the  ratio  of  the  distances  between  the  nearest 
and  the  second  closest  key-point  as  a  similarity  measure, 
we  found  that  the  minimum  distance  leads  to  better  per¬ 
formance  than  the  ratio  on  these  datasets.  A  comparison 
between  the  C2  features  and  the  SIFT-based  features  (both 
passed  to  a  Gentle  Ada  boost  classifier)  is  shown  in  Fig.  5 
(left)  for  the  Caltech  datasets.  The  gain  in  performance  ob¬ 
tained  by  using  the  C2  features  relative  to  the  SIFT-based 
features  is  obvious.  This  is  true  with  gentle  Ada  Boost  - 
used  for  classification  on  Fig.  5  (left)  -  but  we  also  found 


datasets  for  different  number  of  features  (left)  and  on  the  101 -object 


very  similar  results  with  SVM.  Also,  as  one  can  see  in  Fig.  5 
(right),  the  performance  of  the  C2  features  (error  at  equilib¬ 
rium  point)  for  each  category  from  the  101 -object  database 
is  well  above  that  of  the  SIFT-based  features  for  any  number 
of  training  examples. 

Finally,  we  conducted  initial  experiments  on  the  multiple 
classes  case.  For  this  task  we  used  the  101-object  dataset. 
We  split  each  category  into  a  training  set  of  size  15  or  30 
and  a  test  set  containing  the  rest  of  the  images.  We  used  a 
simple  multiple-class  linear  SVM  as  classifier.  The  SVM 
applied  the  all-pairs  method  for  multiple  label  classifica¬ 
tion,  and  was  trained  on  102  labels  (101  categories  plus  the 
background  category,  i.e.,  102  AFC).  The  number  of  C2 
features  used  in  these  experiments  was  4075.  We  obtained 
above  35%  correct  classification  rate  when  using  15  training 
examples  per  class  averaged  over  10  repetitions,  and  42% 
correct  classification  rate  when  using  30  training  examples 
(chance  below  1%). 
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Motorbikes 


Faces 


Airplanes 


Starfish 


Yin  yang 


Figure  6.  (left)  Sample  features  learned  from  different  object  categories  (i.e.,  first  5  features  returned  by  gentle  Ada  Boost  for  each  category). 
Shown  are  S2  features  (centers  of  RBF  units):  each  oriented  ellipse  characterizes  a  Cl  (afferent)  subunit  at  matching  orientation,  while 
color  encodes  for  response  strength,  (right)  Multiclass  classification  on  101  object  database  with  a  linear  SVM. 


5  Discussion 

This  paper  describes  a  new  biologically-motivated 
framework  for  robust  object  recognition:  Our  system  first 
computes  a  set  of  scale-  and  translation-invariant  C2  fea¬ 
tures  from  a  training  set  of  images  and  then  runs  a  standard 
discriminative  classifier  on  the  vector  of  features  obtained 
from  the  input  image.  Our  approach  exhibits  excellent  per¬ 
formance  on  a  variety  of  image  datasets  and  compete  with 
some  of  the  best  existing  systems. 

This  system  belongs  to  a  family  of  feedforward  models 
of  object  recognition  in  cortex  that  have  been  shown  to  be 
able  to  duplicate  the  tuning  properties  of  neurons  in  several 
visual  cortical  areas.  In  particular,  Riesenhuber  &  Poggio 
showed  that  such  a  class  of  models  accounts  quantitatively 
for  the  tuning  properties  of  view-tuned  units  in  inferotem- 
poral  cortex  (tested  with  idealized  object  stimuli  on  uniform 
backgrounds),  which  respond  to  images  of  the  learned  ob¬ 
ject  more  strongly  than  to  distractor  objects,  despite  signif¬ 
icant  changes  in  position  and  size  [16].  The  performance 
of  this  architecture  on  a  variety  of  real-world  object  recog¬ 
nition  tasks  (presence  of  clutter  and  changes  in  appearance, 
illumination,  etc)  provides  another  compelling  plausibility 
proof  for  this  class  of  models. 

While  a  long-time  goal  for  computer  vision  has  been 
to  build  a  system  that  achieves  human-level  recognition 
performance,  state-of-the-art  algorithms  have  been  diverg¬ 
ing  from  biology:  for  instance,  some  of  the  best  existing 
systems  use  geometrical  information  about  the  constitu¬ 
tive  parts  of  objects  (constellation  approaches  rely  on  both 
appearance-based  and  shape-based  models  and  component- 
based  system  use  the  relative  position  of  the  detected  com¬ 
ponents  along  with  their  associated  detection  values).  Biol¬ 


ogy  is  however  unlikely  to  be  able  to  use  geometrical  infor¬ 
mation  -  at  least  in  the  cortical  stream  dedicated  to  shape 
processing  and  object  recognition.  The  system  described  in 
this  paper  is  respects  the  properties  of  cortical  processing 
(including  the  absence  of  geometrical  information)  while 
showing  performance  at  least  comparable  to  the  best  com¬ 
puter  vision  systems. 


The  fact  that  this  biologically-motivated  model  outper¬ 
forms  more  complex  computer  vision  systems  might  at  first 
appear  puzzling.  The  architecture  performs  only  two  major 
kinds  of  computations  (template  matching  and  max  pool¬ 
ing)  while  some  of  the  other  systems  we  have  discussed 
involve  complex  computations  like  the  estimation  of  prob¬ 
ability  distributions  [24,  4,  3]  or  the  selection  of  facial- 
components  for  use  by  an  SVM  [7].  Perhaps  part  of  the 
model’s  strength  comes  from  its  built-in  gradual  shift-  and 
scale-tolerance  that  closely  mimics  visual  cortical  process¬ 
ing,  which  has  been  finely  tuned  by  evolution  over  thou¬ 
sands  of  years.  It  is  also  very  likely  that  such  hierarchical 
architectures  ease  the  recognition  problem  by  decomposing 
the  task  into  several  simpler  ones  at  each  layer.  Finally  it  is 
worth  pointing  out  that  the  set  of  C2  features  that  is  passed 
to  the  final  classifier  is  very  redundant,  probably  more  re¬ 
dundant  than  for  other  approaches.  While  we  showed  that  a 
relatively  small  number  of  features  (about  50)  is  sufficient 
to  achieve  good  error  rates,  performance  can  be  increased 
significantly  by  adding  many  more  features.  Interestingly, 
the  number  of  features  needed  to  reach  the  ceiling  (about 
5,000  features)  is  much  larger  than  the  number  used  by  cur¬ 
rent  systems  (on  the  order  of  10-100  for  [22,  7,  21]  and  4-8 
for  constellation  approaches  [24,  4,  3]). 
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